With increasing scale, large language models demonstrate both quantitative improvement and new qualitative capabilities, especially as zero-shot learners, like GPT-3. However, these results rely heavily on delicate prompt design and large computation. In this work, we explore whether the strong zero-shot ability could be achieved at a smaller model scale without any external supervised data. To achieve this goal, we revisit masked language modeling and present a geometry-guided self-supervised learning method (Go-tuningfor short) by taking a small number of task-aware self-supervised data to update language models further. Experiments show that Go-tuning can enable T5-small (80M) competitive zero-shot results compared with large language models, such as T5-XL (3B). We also apply Go-tuning on multi-task settings and develop a multi-task model, mgo-T5 (250M). It can reach the average performance of OPT (175B) on 9 datasets.
translated by 谷歌翻译
我们为深神经网络引入了两个低位训练后训练量化(PTQ)方法,该方法满足硬件要求,并且不需要长期重新训练。两次量化的能力可以将通过量化和去除化引入的乘法转换为许多有效加速器采用的位移位。但是,两次量表因子的候选值较少,这会导致更多的舍入或剪辑错误。我们提出了一种新型的两个PTQ框架,称为RAPQ,该框架被动态调整了整个网络的两个尺度,而不是静态地确定它们一层。从理论上讲,它可以权衡整个网络的舍入错误和剪辑错误。同时,RAPQ中的重建方法基于每个单元的BN信息。对Imagenet的广泛实验证明了我们提出的方法的出色性能。没有铃铛和哨声,REPQ在RESNET-18和MOBILENETV2上的准确度可以达到65%和48%,分别具有INT2激活INT4的精度。我们是第一个为低位PTQ提出更受限制但对硬件友好型的两次量化方案的人,并证明它可以达到与SOTA PTQ方法几乎相同的准确性。该代码已发布。
translated by 谷歌翻译
In recent years, interest has arisen in using machine learning to improve the efficiency of automatic medical consultation and enhance patient experience. In this article, we propose two frameworks to support automatic medical consultation, namely doctor-patient dialogue understanding and task-oriented interaction. We create a new large medical dialogue dataset with multi-level finegrained annotations and establish five independent tasks, including named entity recognition, dialogue act classification, symptom label inference, medical report generation and diagnosis-oriented dialogue policy. We report a set of benchmark results for each task, which shows the usability of the dataset and sets a baseline for future studies. Both code and data is available from https://github.com/lemuria-wchen/imcs21.
translated by 谷歌翻译
作为众所周知的优化框架,乘法器(ADMM)的交替方向方法在许多分类和回归应用中取得了巨大的成功。最近,它引起了深度学习研究人员的注意,被认为是梯度下降(GD)的潜在替代品。然而,作为新兴领域,一些挑战仍未解决,包括1)缺乏全球收敛保证,2)对解决方案的收敛缓慢,以及3)立方时间复杂于特征尺寸。在本文中,我们提出了一种新颖的优化框架,以通过ADMM(DLADMM)解决一般神经网络训练问题,同时解决这些挑战。具体地,每层中的参数被向后更新,然后向前移动,以便有效地交换每层中的参数信息。当DLADMM应用于特定架构时,通过使用二次近似和回溯技术,通过专用算法设计从立方到二次数据的时间复杂度。最后但并非最不重要的是,我们在温和条件下向第一个趋同的趋同点提供延长的临界点(DLADMM)。七个基准数据集的实验证明了我们提出的DLADMM算法的收敛性,效率和有效性。
translated by 谷歌翻译
图表卷积网络(GCN)已成功应用于许多基于图形的应用程序。然而,培训大规模的GCN模型仍然具有挑战性:由于GCN架构的节点依赖性和层依赖性,培训过程中需要大量的计算时间和内存。在本文中,我们提出了一种基于乘法器(ADMM)的交替方向方法的平行和分布式GCN训练算法,同时解决两个挑战。我们首先将GCN层分成独立块以实现层并行性。此外,通过将图形划分为几个密集的社区来降低节点依赖性,使得它们中的每一个可以并行地用代理训练。最后,我们为基于社区的ADMM算法中的所有子问题提供了解决方案。初步结果表明,我们所提出的基于社区的ADMM培训算法可能导致三倍超速,同时与最先进的方法相比,实现了最佳性能。
translated by 谷歌翻译
快速可靠的连接对于提高公共安全关键任务(MC)用户的情境意识和运营效率至关重要。在紧急情况或灾害环境中,如果现有的蜂窝网络覆盖和容量可能无法满足MC通信需求,可以迅速地利用可部署网络的解决方案,例如单元轮/翼,以确保对MC用户的可靠连接。在本文中,我们考虑一种情况,其中宏基站(BS)由于自然灾害而被破坏,并且设置了携带BS(UAV-BS)的无人驾驶飞行器(UAV-BS)以为灾区中的用户提供临时覆盖。使用5G集成访问和回程(IAB)技术将UAV-BS集成到移动网络中。我们提出了一种框架和信令程序,用于将机器学习应用于此用例。深度加强学习算法旨在共同优化访问和回程天线倾斜以及UAV-BS的三维位置,以便在保持良好的回程连接的同时最佳地服务于地面MC用户。我们的结果表明,所提出的算法可以自主地导航和配置UAV-BS以提高吞吐量并降低MC用户的下降速率。
translated by 谷歌翻译
Diffusion model, a new generative modelling paradigm, has achieved great success in image, audio, and video generation. However, considering the discrete categorical nature of text, it is not trivial to extend continuous diffusion models to natural language, and text diffusion models are less studied. Sequence-to-sequence text generation is one of the essential natural language processing topics. In this work, we apply diffusion models to approach sequence-to-sequence text generation, and explore whether the superiority generation performance of diffusion model can transfer to natural language domain. We propose SeqDiffuSeq, a text diffusion model for sequence-to-sequence generation. SeqDiffuSeq uses an encoder-decoder Transformers architecture to model denoising function. In order to improve generation quality, SeqDiffuSeq combines the self-conditioning technique and a newly proposed adaptive noise schedule technique. The adaptive noise schedule has the difficulty of denoising evenly distributed across time steps, and considers exclusive noise schedules for tokens at different positional order. Experiment results illustrate the good performance on sequence-to-sequence generation in terms of text quality and inference time.
translated by 谷歌翻译
In this paper, we introduce a novel optimization algorithm for machine learning model training called Normalized Stochastic Gradient Descent (NSGD) inspired by Normalized Least Mean Squares (NLMS) from adaptive filtering. When we train a high-complexity model on a large dataset, the learning rate is significantly important as a poor choice of optimizer parameters can lead to divergence. The algorithm updates the new set of network weights using the stochastic gradient but with $\ell_1$ and $\ell_2$-based normalizations on the learning rate parameter similar to the NLMS algorithm. Our main difference from the existing normalization methods is that we do not include the error term in the normalization process. We normalize the update term using the input vector to the neuron. Our experiments present that the model can be trained to a better accuracy level on different initial settings using our optimization algorithm. In this paper, we demonstrate the efficiency of our training algorithm using ResNet-20 and a toy neural network on different benchmark datasets with different initializations. The NSGD improves the accuracy of the ResNet-20 from 91.96\% to 92.20\% on the CIFAR-10 dataset.
translated by 谷歌翻译
Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.
translated by 谷歌翻译
Detecting sarcasm and verbal irony from people's subjective statements is crucial to understanding their intended meanings and real sentiments and positions in social scenarios. This paper describes the X-PuDu system that participated in SemEval-2022 Task 6, iSarcasmEval - Intended Sarcasm Detection in English and Arabic, which aims at detecting intended sarcasm in various settings of natural language understanding. Our solution finetunes pre-trained language models, such as ERNIE-M and DeBERTa, under the multilingual settings to recognize the irony from Arabic and English texts. Our system ranked second out of 43, and ninth out of 32 in Task A: one-sentence detection in English and Arabic; fifth out of 22 in Task B: binary multi-label classification in English; first out of 16, and fifth out of 13 in Task C: sentence-pair detection in English and Arabic.
translated by 谷歌翻译